SODAR: managing multiomics study data and metadata

Abstract Scientists employing omics in life science studies face challenges such as the modeling of multiassay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter. We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multiassay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command-line access for metadata and file storage. SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.


Introduction
Modern studies in the life sciences often involve running multiple "omics" assays (e.g., genomics, proteomics, and metabolomics). Such studies require careful planning, data collection, data analysis and integration of data. An example of such complex study is (Esterhuyse et al., 2015) in infection biology, which will be used as an example below. Ideally, scientists are supported by a detailed modeling of each of the involved steps to keep track of the status of data acquisition as well as relevant factors and confounders.
The most comprehensive standard for describing study metadata is the ISA-Tab format (Sansone et al., 2012) which allows modeling studies with multiple samples and assays. ISA-Tab is a tabular file format that allows users to model each processing step with each intermediate result and annotate each of these with arbitrary metadata. Alternatives include Portable Encapsulated Projects (PEP) by Sheffield et al., (2021).
Another characteristic of modern omics studies is that they generate large volumes of data ranging from few gigabytes for mass spectrometry to tens of gigabytes for genomics sequencing to terabytes for imaging/microscopy. Such data sets must be managed both in the intrinsic complexity of their structure and metadata, generated raw data, and subsequent computational processing results. In the simplest case, data can just be stored using file systems or object storage systems. More advanced systems such as Shock (Bischof et al., 2015) or dCache (Ernst et al., 2001) also allow for storing metadata and distributing the data to a "data grid" over multiple servers. iRODS (Hedges et al., 2009;Chiang et al., 2011) adds even further features such as running programs within the data system and providing integration with arbitrary authentication systems.
Once published, multi omics study data is often deposited in public data portals such as BioSamples database (Courtot et al., 2022). However, before completion and publication researchers need to capture experiment and sample metadata as well as the generated mass data in private systems.
Systems for capturing data and experiment metadata include ELN (laboratory notebooks, cf. Higgins et al., 2022). Raw experiment data is commonly stored in LIMS (laboratory information management systems). Few published dedicated systems for storing both omics mass and metadata are available such as qPortal (Mohr et al., 2018;Cuellar et al., 2022) which is itself based on OpenBIS (Bauch et al., 2011).
In this manuscript, we introduce SODAR (the system for omics data access and retrieval). SODAR combines the modeling of studies and assays using the ISA-Tab standard with handling of mass data storage using iRODS (integrated rules-oriented data system). We demonstrate the features of SODAR with a multi-omics use case. More example projects are available in the SODAR online demo at https://sodar-demo.cubi.bihealth.org.

Results
We first describe the SODAR system. We then perform a qualitative comparison of SODAR with similar software. We then describe the overall process of using SODAR, provide a multi-omics use case and show realizations of the abstract steps described above. Figure 1 shows the components of the SODAR system. SODAR Server contains the main system logic, providing a web-based user interface (UI) and REST APIs for managing metadata. Mass data storage is implemented using iRODS, and the common WebDAV protocol is provided by Davrods. Non-computational users can interface with SODAR using the graphical UI, whereas computational users can use command line interfaces and REST APIs from scripts and other external software.

Comparison of Features
Our motivation for the development of SODAR was the unavailability of an appropriate system for both serving for modeling experiments and storing meta and mass data. We collected systems with similar features in Table 1 and compared them to SODAR using categories for key SODAR features. Sample/Process Annotation Samples (and analytes) as well as processes should be annotated according to information to capture relevant parameters and measures.

Ontology Support
A key feature for capturing information in a structured way is the support of ontologies.

Arbitrary Experiments
It is important whether systems allow the modeling of arbitrary experiments or are limited to certain kinds (e.g., LIMS often are flexible in principle but are usually configured for streamlined web-lab processes).

Multi-Omics Experiments
Relatedly, SODAR allows for capturing that the same set of samples has been subjected to different experiments or assays ("multi-omics support").
Large file support is another key feature and enables storing mass data for the experiment results.
Custom Installation Finally, we captured whether a custom installation using on-premises hardware is possible with the different systems.
REST APIs in data management systems enable many powerful applications such as integration with existing database systems or integration into third-party client software.
All these features are supported by SODAR, making it a unique framework for supporting users in multiomics data management. Importantly, SODAR provides REST APIs and hyperlinks into its metadata and mass data repositories and thus can be easily integrated with other systems. Note that each of the systems considered caters to a different niche and SODAR is not meant to replace any of the other software packages or outclass them in their niche.

General Sodar Process
The general workflow in using SODAR for managing data and metadata is shown in Figure 2. We distinguish between the roles "data steward" and "experimentalist", however in some cases one person might have both roles. The former are responsible for creating the overall structure of the data, while the latter are responsible for entering the actual data into the system.
Data stewards are users who are experienced with using ISA-tab files and in our use case are bioinfsormaticians working in the core unit. They are responsible for modeling the experiments in the ISA-tab format as "sample sheets" with the overall experimental design. They generally also maintain a library of sample sheet templates for common use cases. With experienced experimentalists the steward might just create the general structure of the experiment. When necessary, the steward might pre-create the sample sheet with a full skeleton of all planned samples and processes and IDs together with the experimentalists.
Experimentalists are users who are more concerned with completing the data in the sample sheet rather than in the creation of its structure. When the full sample sheet is created together with the stewards, they might only verify the structure with the information of their experiments (that is stored in a ELN, for example) and fill in some measurements in sample sheet cells (e.g., concentration measurements).
More trained and experienced experimentalists will also create new rows for samples.
Of course, using the REST API of SODAR it is possible to automate all these tasks. For example, an integration with a LIMS system could automatically create samples as they are processed in the wet-lab while measurements could be written to SODAR from the LIMS or from an integration of an ELN system.
SODAR provides a set of templates for common experiment types, but users can also use external software such as ISA-tools (Sansone et al., 2012). into steps attributed to a data steward (blue) who manages the overall data schema and experimental user (green) who enters the actual data or uploads files.
In general, there are two types of data in the SODAR system. The metadata, which includes information about the samples, procedures, analysis, experimental scheme etc., is stored in the ISA-tab and can be edited within the SODAR system using a GUI or manipulated with the REST API. In contrast, the experimental data -results of the measurements, for example FASTQ files for sequencing or mass spectrometry XML files -can be uploaded to SODAR using a separate two-stage GUI or uploaded using the REST API.
For each experiment, SODAR manages a corresponding directory in the iRODS data repository. Each sample/analyte is associated with a sub directory in this repository and users can upload data for individual samples/analytes or whole experiments. To enforce checking the integrity and correctness of the data, users do not manipulate data directly in the data repository. Instead, they first upload data to "landing zones" where they have full read-write access. SODAR enforces certain best practices such as requiring a checksum for each uploaded file. Once the uploading user is content with the uploaded data, they submit the landing zone for import into the project data repository. SODAR checks the uploaded data and moves data to the project data repository where it is immutable for the general users. Users can submit requests for deleting data which has to be confirmed by a project owner or delegate.
As mentioned above, SODAR associates the data in the iRODS project data repository with the samples and materials based on the directory names in iRODS. Users can easily access any file in the projects that they have access to via the SODAR UI, WebDAV which allows mounting the storage on their desktop machine, or the iRODS protocol and command line tools. The metadata can be exported to ISA-Tab files.
Uploading ISA-Tab files is also allowed, if the user has existing files or wants to use external applications for editing them. We further added special support for the IGV genome browser ( To conclude, SODAR supports computational and experimental users with functionality to model their experiments, upload resulting files, and accessing the files through effective and easy-to-use means.

Use Case Description
In the following sections, we will show how SODAR can be used for supporting multi-omics studies using the one by Esterhuyse et al. (2015) as an example. To be clear, this study was originally not performed using SODAR. We will illustrate the modeling, data import, and upload steps. was collected from each patient and then subjected to DNA methylation and transcription analysis using microarrays and proteomics analysis was performed using mass spectrometry. The resulting data was then subjected to statistical analysis and led to the published article.

Modeling
When applying SODAR, our group of bioinformaticians and biostatisticians meets with the experimentalists and discuss sample sizes and suitable assay types. To simplify the description we describe the people with the computational/statistical knowledge as having the "biostatistician" role. In the case of the given study, we would decide together with the experimentalists on the given sample types given availability of funding and donors. Our work focuses on commonly used assays which focus on NGS-based ones but also include certain proteomics and metabolomics assays used by labs that we are collaborating with regularly.
In the case of the TB study, a member of our group would take the role of the data steward and we would first create the sample sheet structure of the blood sampling itself. In ISA terminology, the blood donors are "source" while the collected blood is "samples". The data steward would define relevant source factors (e.g., acute or latent infection) and important confounders (such as age) together with the biostatistician and the experimentalists. The biostatisticians are generally trained as data stewards and thus have both roles but they might also talk to other data stewards in case of questions.
We would then continue to model the relevant parts of the experiment. Important modeled experiment steps may be the extraction of analytes such as RNA, the measurements themselves, and the vendor software and version used for the primary data analysis. Important properties of the occurring analytes ("materials" in ISA terminology) and processes include RNA concentrations, the used microarrays including lot numbers, as well as software versions.

Sample Metadata Definition/Data Entry
Data stewards use ontologies and controlled vocabularies where possible. This would be discussed with the experimentalists and suitable terms would be agreed upon together. In the case of ambiguity, it makes sense to attempt to use the same term for the same real-world object across projects as to improve data reusability.
In the work of our group, we would generally fill one or two example rows with our collaborators but then hand it over to them to fill in the actual metadata. The resulting sample sheet would then be iteratively improved through review by the data stewards and further discussion with the experimentalists. Adjustments may include adding further columns (e.g., for measurements) to the sample sheet, adjustment of the used ontology terms, and adding or removing sample rows.
Of course, projects generally aim at having a stable plan of the work but adjustments may be required over time. Common reasons for such adjustments are drop-outs during certain assays, additional measurements becoming required, or having to perform additional assays or use additional sample during review for publication. SODAR allows such adjustments to the data it handles over time and stores the sample sheet versions after each change. It also includes a tool for comparing sample sheets to inspect the performed changes.

Raw Data Import & Raw Data Access for Processing & Result Import after Processing
Eventually, the experimentalists perform the modeled wet-lab steps, measurements, and primary data analysis. The resulting raw data is imported into SODAR by first creating a landing zone, uploading the data, either done by lab technicians or us bioinformaticians, and then moving the data of the editable landing zone to the read-only per-project data repository.
The biostatisticians obtain the metadata and mass data created by the experimentalists and download them using the web interface or command line interfaces/REST APIs. The data is analyzed as appropriate. Resulting data and reports are then uploaded to SODAR, again using landing zones. This is usually followed by a series of discussions with the experimental partner where the analysis is refined and the subsequently generated resulting data files and reports are deposited in the read-only SODAR per-project data repository. Of course, such iterations might include having to update the sample sheets as described in Section 2.2.2 and subsequent reanalysis. In the case that files in the readonly data repository are to be replaced, they have to be removed before which can be done using a twostep/four-eyes process where users can create a deletion request that the project owner has to confirm.

Resulting Data Access
The read-only project data repository is intended for long-time storage. All data is available to the experimentalist in a self-service fashion such that they can re-use all data in subsequent studies or access the intermediate and final results to answer requests for data sharing or questions regarding their publication.

Internal Usage Statistics
In the spirit of "eating your own dog food", we have been using SODAR in our group's projects for the past four years. Table 2 gives summary statistics of data and metadata stored in our internal instance as well as the diversity of projects. We thus tested SODAR extensively in a real-world setting and use it daily as our main storage for all our project data and metadata.

Methods
SODAR is implemented in Python 3 using the Django web framework and Django REST Framework.
Reusable components have been extracted into the library SODAR Core . ISA-Tab format manipulation has been implemented using AltamISA (Kuhring et al., 2019).

Project Organisation, Authorization Structure, and LDAP Integration
SODAR uses the concept of "projects" for organizing all data. Projects have a unique identifier and some basic metadata such as title, description, etc. Projects can be organized in a tree structure using the concept of "categories" that can contain projects or other categories. Each project has a single owner, who can assign themselves a delegate for managing the project. Further users can be granted access to the project either in a read-write (contributor) or a read-only fashion (guest) using role-based access control.

SODAR can be configured to be run standalone or integrated with LDAP servers (including Microsoft
ActiveDirectory) for providing authentication information, where authentication refers to checking the identity of a user based on their username and password.

iRODS integration
SODAR automatically manages user access to projects in iRODS. This is done by creating an iRODS directory and user group for each project. The group is given access to the directory and group membership is synchronized between the SODAR database and iRODS.
Further, SODAR creates a sub directory for each study and assay from the ISA model of the project.
Users can use the landing zone mechanism for adding files for each sample/analyte or add them for the whole study or assay. Users can add thus add data for an arbitrary number of assays for each sample and original donor or specimen.
The files can be accessed either directly through the iRODS protocol or using the WebDAV protocol through the Davrods (Smeele&Smeele, 2016) software. The latter allows users to access the storage as a network drive on their desktop computers. Since WebDAV is HTTP based, users can also make data available to genome browsers such as IGV or UCSC Genome Browser. Moreover, it is generally easy to access data through an organization's firewall and proxies without intervention of IT departments.
Optionally, SODAR allows the management of iRODS "tickets", which allow for access based on randomly generated tokens instead of user login. This way, users can upload genome browser tracks to SODAR/iRODS and create public URL strings to access them and share them with users that do not have access to the full project (or do not even have an account in SODAR).

Sample Sheet Editor, Import, Export
Sample sheets can be included into SODAR projects by either importing existing ISA-Tab files or template-based creation. A single project corresponds to an "investigation" in the ISA-Tab naming convention. When importing, the user can upload a Zip archive or a set of individual ISA-Tab files. For creating sample sheets from templates, the user needs to fill in certain details in the SODAR UI. SODAR provides multiple built-in templates for, e.g., generic RNA sequencing, germline DNA sequencing and mass spectrometry-based metabolomics. After import or creation, the sample sheets are stored in an object-based format in the SODAR database for easy search and modification. In the UI, they are presented to the user as spreadsheet-style study and assay tables.
The user can edit sample sheets in the SODAR UI (Sup. Figure 3). Cells in the study and assay tables can When editing sample sheets, old sheet versions are stored as backup. These versions can be compared and restored in case of mistakes, as well as exported from the system. SODAR allows for sample sheet export in the full ISA-Tab TSV format, or simplified Excel tables. Replacing existing sheets with versions modified outside of SODAR is also supported.

Integrating SODAR Core based sites
Several subcomponents of the SODAR server such as project and user management have proven to be useful in other contexts. We have extracted them into the SODAR Core library  which forms the foundation of other projects such as VarFish

SODAR Administration
We provide a straightforward way to install SODAR and related components (SODAR, iRODS, Davrods, and supporting database servers) and maintain such an installation based on Docker containers and Docker compose. Detailed installation instructions can be found in the sodar-docker-compose repository linked to in the section "4.2 Source Code Availability".
The whole system can be set up using an external LDAP/ActiveDirectory server for users and credentials or as an alternative in a standalone fashion where SODAR hosts this information. Further, users can of course use their external iRODS installation. Finally, SODAR features administrator dashboards for providing statistics about projects and usage of storage resources.

Figure 2
Click here to access/download; Figure;Figure_2.pdf